Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: fix unrecoverable freezes of rabbit's consumer #10594

Open
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

bougue-pe
Copy link
Contributor

@bougue-pe bougue-pe commented Jan 30, 2025

Hard fix (kind of) to kill the process if a thread's exception goes up to main(), or if rabbit's cancel/shutdown notification is received (even if other threads may run) and let orchestrator restart it.

Bump amqp-client on the way as it doesn't hurt.

Fixes #8621
Also reproduced and fixed the case that leads to the following (different) core logs:

[11:16:04,880] [INFO]          [WorkerCommand] consume shutdown: amq.ctag-LsXBfmFdL6n758icthlCYA, com.rabbitmq.client.ShutdownSignalException: connection error; protocol method: #method<connection.close>(reply-code=320, reply-text=CONNECTION_FORCED - broker forced connection closure with reason 'shutdown', class-id=0, method-id=0)
[11:16:04,883] [WARN]  [ForgivingExceptionHandler] An unexpected connection driver error occurred (Exception message: Connection reset by peer)
Exception in thread "main" com.rabbitmq.client.AlreadyClosedException: connection is already closed due to connection error; protocol method: #method<connection.close>(reply-code=320, reply-text=CONNECTION_FORCED - broker forced connection closure with reason 'shutdown', class-id=0, method-id=0)
        at com.rabbitmq.client.impl.AMQConnection.startShutdown(AMQConnection.java:1012)
        at com.rabbitmq.client.impl.AMQConnection.close(AMQConnection.java:1127)
        at com.rabbitmq.client.impl.AMQConnection.close(AMQConnection.java:1056)
        at com.rabbitmq.client.impl.AMQConnection.close(AMQConnection.java:1040)
        at com.rabbitmq.client.impl.AMQConnection.close(AMQConnection.java:1032)
        at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.close(AutorecoveringConnection.java:289)
        at kotlin.io.CloseableKt.closeFinally(Closeable.kt:56)
        at fr.sncf.osrd.cli.WorkerCommand.run(WorkerCommand.kt:319)
        at fr.sncf.osrd.App.main(App.java:44)

Hand-tested

  • Run classic (one per infra) core with rabbitmq up, without editoast to load infra from: ✅core crashes.
  • Run single-worker core with the whole stack running: ✅core works and OSRD does its job.
  • Run single-worker core with the whole stack, then stop rabbitmq: ✅core exits (on shutdown notification).
  • Run full stack single-worker mode, then remove the core-req-all queue to initiate cancel notification: ✅core logs "consumer cancelled" then stops (should cover Core message consumer fails and blocks the scenario #8621).
  • Run full stack + single-worker editoast (no core), initiate some core requests (using front).
    • Then stop editoast and start single-worker core (so impossible to load infra inside DeliverCallback): ✅core crashes and releases unacked messages.
    • Then start editoast, then start core: ✅core works and OSRD does its job.
    • Then stop editoast and start single-worker core (so impossible to load infra inside DeliverCallback): ✔️❓Exceptions in threads when trying to load infra for pending requests, core stays alive (leaving pending requests unacked until core is stopped - after step below).
      ➡️We can try/catch in callback function and exitProcess to force shutdown.
      DONE in last commit.
    • Then start editoast (keep core as-is) and initiate some new core request (using front): ✅core "correctly" processing only new requests.
    • Stop core: ✅Unacked requests are back to ready.
    • Start core: ✅Ready requests are processed.

Understanding of previous and current work

Previous:

Current:

  • ShutdownCallback looks like a thread join (triggering final return) and it may be an idea to use shutdown notification to exit cleanly
  • The thread executors are messing with the handling of exceptions
  • Looks like the CancelCallback doesn't trigger a "join" or a shutodwn process (because of threads? maybe more because it's on the implementation to decide what's graceful?)

Looks like some improvements may be done (to be explored later, sorted by ROI)

  • Improve/explicit some work applied in core: improve worker lifetime #9439 (use isRecovarable, cleanup consumer logs, properly close channels and connections)
  • Avoid sharing channels between threads as stated in dedicated documentation
  • Handle more standardly shutdown (on exceptions or on notification) by issuing a shutdown notification, or sharing a signal (or common variable?)

@bougue-pe bougue-pe requested review from Khoyo and ElysaSrc January 30, 2025 10:16
@bougue-pe bougue-pe requested a review from a team as a code owner January 30, 2025 10:16
@github-actions github-actions bot added the area:core Work on Core Service label Jan 30, 2025
@codecov-commenter
Copy link

codecov-commenter commented Jan 30, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.93%. Comparing base (b3a6f01) to head (e6c5f8c).
Report is 2 commits behind head on dev.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##              dev   #10594      +/-   ##
==========================================
- Coverage   81.93%   81.93%   -0.01%     
==========================================
  Files        1079     1079              
  Lines      107380   107376       -4     
  Branches      737      737              
==========================================
- Hits        87984    87978       -6     
- Misses      19356    19358       +2     
  Partials       40       40              
Flag Coverage Δ
editoast 74.28% <ø> (-0.01%) ⬇️
front 89.47% <ø> (-0.01%) ⬇️
gateway 2.18% <ø> (ø)
osrdyne 3.28% <ø> (ø)
railjson_generator 87.50% <ø> (ø)
tests 88.14% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -316,7 +320,7 @@ class WorkerCommand : CliCommand {
if (!channel.isOpen()) break
}

return 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to this work out with multithreading?

I don't quite understand when this line is reached actually. It is when all threads have died, or just one?

Maybe some time we should rewrite this class to lower the amount of nested callbacks and functions. It's a little difficult to follow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, not understanding it all either.

After quite a few hand-tests and trying various things I landed a different proposition: please take a look at the (updated) main comment of the PR (which may be updated again later - will notify).

@bougue-pe bougue-pe force-pushed the peb/core/fix_core_freeze_on_rabbit_shutdown branch from 06726c5 to ee9d8aa Compare January 30, 2025 10:42
@bougue-pe bougue-pe self-assigned this Feb 5, 2025
@bougue-pe bougue-pe marked this pull request as draft February 5, 2025 07:14
@bougue-pe bougue-pe force-pushed the peb/core/fix_core_freeze_on_rabbit_shutdown branch 2 times, most recently from b9647fc to c8bd86f Compare February 5, 2025 15:31
@bougue-pe bougue-pe requested a review from woshilapin February 5, 2025 17:49
@bougue-pe bougue-pe marked this pull request as ready for review February 5, 2025 17:49
@bougue-pe bougue-pe force-pushed the peb/core/fix_core_freeze_on_rabbit_shutdown branch 2 times, most recently from 54ad82d to 17f2e86 Compare February 6, 2025 14:57
Hard fix (kind of) to kill the process (and let orchestrator restart) if:
* rabbit shuts down (triggering consumer's ShutdownCallback)
* or an exception before starting basicConsume() goes up to main()
  (even if other threads may run)

Bump amqp-client on the way as it doesn't hurt

Signed-off-by: Pierre-Etienne Bougué <[email protected]>
From hand-tests, shutdown is already covered by the System.exit in
App.java::main().

Signed-off-by: Pierre-Etienne Bougué <[email protected]>
…llback

Hard fix (kind of) to kill the process (and let orchestrator restart) if an
exception goes all the way up to the DeliverCallback.
For example when not able to reach editoast for infra reload.
This will release unacked messages and move them back to ready (instead
of keeping them unacked until the worker exits).

Signed-off-by: Pierre-Etienne Bougué <[email protected]>
@bougue-pe bougue-pe force-pushed the peb/core/fix_core_freeze_on_rabbit_shutdown branch from 3ecda5e to e6c5f8c Compare February 6, 2025 15:37
@bougue-pe bougue-pe changed the title core: fix unrecoverable freeze when rabbit shuts down core: fix unrecoverable freezes of rabbit's consumer Feb 6, 2025
@bougue-pe
Copy link
Contributor Author

A third commit was pushed, and the main comment updated (and all rebased on dev).

No more work is planned on this, please read the main comment, any feedback is welcome, and we should be good to go 🙏

@bougue-pe
Copy link
Contributor Author

bougue-pe commented Feb 6, 2025

There is a bit more work, actually (for me): test if it improves the case described in #10704
EDIT: Looks like a different issue, not investigating more on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core Work on Core Service
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Core message consumer fails and blocks the scenario
4 participants